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Abstract 

We present a family of gossiping algorithms whose mem¬ 
bers share the same structure though they vary their perfor¬ 
mance in function of a combinatorial parameter. We show 
that such parameter may be considered as a “knob ” con¬ 
trolling the amount of communication parallelism charac¬ 
terizing the algorithms. After this we introduce procedures 
to operate the knob and choose parameters matching the 
amount of communication channels currently provided by 
the available communication system(s). In so doing we pro¬ 
vide a robust mechanism to tune the production of requests 
for communication after the current operational conditions 
of the consumers of such requests. This can be used to 
achieve high performance and programmatic avoidance of 
undesirable events such as message collisions. 

1 Introduction 

The main character in this text is a family of algorithms 
for distributed gossiping whose members differ in the strat¬ 
egy adopted to discipline the right to transmit. That strategy 
can be expressed as a permutation of the indices of the par¬ 
ticipants. In 121 a formal model for this family of algorithms 
was introduced and the performance of some of its members 
was analyzed. In the cited paper in particular it was shown 
how the choice of the structure of the permutations control¬ 
ling these algorithms translates in different requirements on 
the underlying communication system—namely, different 
amounts of concurrent send and receive requests. 

The focus of this paper is not on the functional properties 
of the gossiping algorithms but rather on the non-functional 
characteristics exhibited by them with the change of the 
adopted permutations. Building on top of our past research. 


here we first show that by assigning different classes of per¬ 
mutations to the participants it is possible to scale dynam¬ 
ically the amount of communication requests triggered by 
the execution of our algorithm. In other words, this enables 
the expression of a spectrum of codes each characterized by 
a different algorithmic parallelism. 

Secondly, we show here how this can be used to reach an 
optimal match with the contextual (that is, physical) paral¬ 
lelism provided by the deployment platform and networks. 
Such an optimal tuning allows the avoidance of shortcom¬ 
ing and excess of algorithmic parallelism. While in the for¬ 
mer case one would under-utilize the available resources, in 
the latter case one would issue a number of requests higher 
than the available communication resources, which could 
lead to undesirable conditions such as overloading of re¬ 
quest queues and packet collisions. 

In what follows we first define our family of algorithms 
and concisely recall its characteristics in Sect. |2l Next, in 
Sect. [3l we introduce hybrid gossiping—a strategy to tune 
the algorithmic parallelism in function of the physical paral¬ 
lelism characterizing the current context. Section|4]explains 
how that strategy can be used to design autonomic evolution 
engines exploiting hybrid gossiping. State of the art is then 
briefly summarized in Sect. |5] Our conclusions follow in 
Sect.|6l 

2 A Family of Gossiping Algorithms 

In this section we recall the main features of a family 
of gossiping algorithms firstly introduced in Q. We refer 
the reader to the cited paper for a thorough discussion of 
the features of those algorithms. Introductions to gossiping, 
which may be concisely defined as all-to-all pairwise inter¬ 
process communication, may be found e.g. in ifTSl I^ fTTIl . 
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In what follows we define a formal model for such family 
of algorithms and we highlight the characteristics of two of 
its members. 

2.1 Formal Model 

Let t represent time and iV > 0 be an integer. We shall 
consider a set of + 1 communicating processes. We as¬ 
sume that such processes may be uniquely identified via in¬ 
tegers in {0,..., N}. Processes are deployed in processing 
nodes linked together via one or more communication net¬ 
works. We shall refer to the set of nodes and networks as to 
“the system”. Nodes are equipped with a limited number of 
communication ports. Likewise, networks provide a limited 
number of independent full-duplex point-to-point commu¬ 
nication lines. At any time t a new communication can only 
be initiated if a free port and a free line are available in the 
system. If that is not the case, the requesting process is put 
in a wait state. Due to resource competition, the number of 
ports and that of lines vary dynamically. Depending on the 
available ports and lines, at any time t at most Af{t) send 
and at most JV{t) receive requests may be allowed to exe¬ 
cute. Communication is synchronous and blocking. 

Processes own some local data they need to share (for 
instance, to execute a voting algorithm as in Q or Q). In 
order to do so, each process broadcasts its local data to all 
the others through multiple consecutive send requests, and 
receives the N data items owned by its fellows via multi¬ 
ple consecutive receive requests. A discrete time model is 
assumed—events occur at discrete time steps, and during 
any time step any process can be involved in only one such 
event. More specifically, on a given time step t process i 
may be: 

1. sending a message to process j,j ^ i; this is repre¬ 
sented as i S*j ; 

2. receiving a message from process j,j ^ v, this is 
shown as i R*j ; 

3. blocked, waiting for messages to be received from any 
process; symbol will be used to mean this case; 

4. blocked, waiting for a message to be sent, i.e. for an 
addressee to enter the receiving state. Symbol “r\” 
will be used for this. 

A slot is defined as a process’ temporal “window” one 
time step long. On each given time step t, N + 1 slots are 
available within the system. Process i makes use of slot t if 
and only if 3j {i S*j V i R*j)', on the contrary, process i is 
said to waste slot t. 

By the term “run” we shall refer in what follows to the 
the collection of slots required to execute the above algo¬ 
rithm on a given system, as well as to the values of the 
events corresponding to those slots. 


Let us define the following four state templates: 

WR state. A process is in state WRj if it is waiting for 
the arrival of a message from process j. Where the 
subscript is not important it will be omitted. Once in 
WR, a process stays there for one or more time steps, 
corresponding to the same number of actions. 

S state. A process i is in state Sj when it is sending a mes¬ 
sage to process j. After one time step i leaves state Sj. 
This corresponds to one i S^j action. 

WS state. A process i waiting to send process j its mes¬ 
sage is said to be in state WSj . Where the subscript is 
not important it will be omitted. Once in WSj, process 
i stays there for one or more time steps, corresponding 
to the same number of “r>” actions. 

R state. When process i is receiving a message from pro¬ 
cess j, it is said to be in state Rj. This state transi¬ 
tion also lasts one time step and corresponds to action 
iR*j. 

Let Vi,...,Vn represent a permutation of integers 
0,... ,i — l,i + 1, ■ ■ ■, N. Then the above state templates 
can be used to compose N + 1 finite state automata as de¬ 
scribed in Algorithm 1. 

Figure [T] shows the finite state automaton that solves dis¬ 
tributed gossiping for process i, which we obtained by ex¬ 
ecuting Alg. 1. The first row is the condition that has to be 
reached before process i is allowed to begin its broadcast: a 
series of i (WR, R) couples. 

Once process i has successfully received i messages, it 
acquires the right to broadcast. Broadcasting is performed 
according to the rule expressed in the second row of Fig. [1] 
process i orderly sends its message to its fellows, the j-th 
message being sent to process Vj . 

The third row of Fig. [T] instructs the reception of the re¬ 
maining N — i messages, which is coded as a sequence of 
N — i {WR, R) couples. 

In 13 it was shown how, irrespective of the value of V, 
such FSA’s implement a distributed deadlock-free gossiping 
algorithm. As intuition may suggest, the choice of which 
permutation to use has indeed a deep impact on the over¬ 
all performance of the algorithm—together with the physi¬ 
cal characteristics of the system. In fact, different permuta¬ 
tions translate in different amounts of communication par¬ 
allelism; when such algorithmic parallelism is backed up 
by contextual parallelism—that is, by a sufficiently large 
number of independent communication ports and lines in 
the system, modeled as dynamic system Af{t )—then there 
is an optimal match between the algorithm and the deploy¬ 
ment platform. 

In order to evaluate the above impact we shall make use 
of the following “quality metrics”: 


Algorithm 1: Compose the FSA solving gossiping 
for process i e {0,..., N} 

Input: A = (*, N, V) 

Output: FSA (A) 

1 begin 

/* emit the initial state */ 

2 FSA (A) := START 

3 for j := 0 to i — 1 do 

/* operator ” appends a new state to the FSA */ 

4 FSA(A) ^ WR 

5 FSA(A) ^ R 

6 enddo 

7 for j := Ito N do 

8 FSA(A) ^ WSp, 

9 FSA(A) ^ Sp,, ' 

10 enddo 

11 for j := i + 1 to N do 

12 FSA(A) ^ WR 

13 FSA(A) ^ R 

14 enddo 

/* emit the hnal state */ 

15 FSA(A) ^ STOP 

16 end. 


Average slot utilization. This is the average number of 
used slots per time step in a given run. It will be in¬ 
dicated as pjv, or simply as p,. p can be interpreted 
as the average degree of parallelism expressed by the 
algorithm—hence it will be referred to also as the “al¬ 
gorithmic parallelism”, p can take any real value in 
[0,iV+l]. 

Length. This is the number of time steps in a run. It repre¬ 
sents a measure of the time needed for the distributed 
algorithm to complete. Aat, or more simply A, will be 
used for lengths. 

For any time step t, we shall call Vt as the number of slots 
that were used during t. The A-tuple F = 
orderly encoding the number of used slots for each time step 
in a run, shall be called “utilization string.” 

In 121 several cases of V were introduced and discussed. 
In particular in the above cited reference it was shown how 
varying the structure of V produces quite different values 
of p and A. This fact, coupled with physical constraints 
of the system as modeled by M{t), determine the overall 
performance of our algorithm. 

In what follows we focus on two particular cases repre¬ 
senting the minimal and maximal algorithmic parallelism. 


2.2 Identity Permutation 

As a hrst case, let V be the identity permutation; 

/O,... - !,■«-f 1,..., A^A 

i.e., in cycle notation IITtII , iP = (0)...(*-l)(z+l)...(iV). 

This means that, once process i acquires the right to 
broadcast, it hrst sends its message to process 0 (possibly 
having to wait for it to become available), then it will do the 
same with process 1, and so forth up to process N, obvi¬ 
ously skipping itself. This is shown in Table[T]for N = 5. 
In what follows we shall refer to tables such as Table [T| as to 
“run-tables.” 

In Q it was shown how it is possible to characterize 
some properties of the quality metrics of the member corre¬ 
sponding to the identity permutation. Among such proper¬ 
ties particularly useful here are the following two ones; 

• The algorithm makes use of 0{N‘^) time—more pre¬ 
cisely, Ajv = I Af" + fAT + ^ LA^/2J. 

• The asymptotic value of algorithmic parallelism, that 
is limfc_). 

OO Pk, equals |. 





Figure 1. The state diagram of the FSA run by process i. The first row consists of i {WR, R) coupies. 

represents a permutation of the N integers 0,..., i -1, i +1,..., iV. The iast row contains 

N - i (WR, R) coupies. 


2.3 Pipelined Permutation 

We now consider a second case—the one corresponding 
to permutation 

/O, + 

Note how permutation (|2]) is equivalent to i cyclic logical 
left shifts of the identity permutation. 

When V is as in (|2]l, then process i first sends its mes¬ 
sage to process i + 1, then to process i + 2, and so on until 
it reaches process N. After that, i wraps around and sends 
from process 0 to process i — 1. This is shown in Table |2] 
for N = 8. As can be seen from that table, ^ maximally 
overlaps the processes’ broadcast sessions the same way as 
machine instructions are being overlapped in pipelined mi¬ 
croprocessors 03 . This similarity brought to the name of 
“pipelined permutation” for (|3 Q- 

In 121 some of the quality metrics of the member corre¬ 
sponding to the pipelined permutation were also character¬ 
ized. In particular it was shown that; 

• The algorithm makes use of 0{N) time, and more pre¬ 
cisely Xn = 8N. 


• Algorithmic parallelism linearly depends on the 
amount of involved processes: \/k > 0 : Hk — 
f(fc+l). 

3 Tuning Algorithmic Parallelism through 
Hybrid Gossiping 

The two cases introduced in the previous section repre¬ 
sent two “extremes” in the spectrum of possible permuta¬ 
tion structures: the identity permutation and the pipelined 
permutation respectively translate in very low and very high 
algorithmic parallelism. These emerging behaviors charac¬ 
terize homogeneous gossiping—gossiping that is in which 
all processes make use of the same permutation. A useful 
property of our family of algorithms is that it also supports 
hybrid gossiping; in this case the gossiping processes make 
use of a permutation selected (with some predefined logic) 
from two or more classes, as depicted in Fig.|2] A notewor¬ 
thy assignment logic is the one that schedules the pipelined 
permutation to a certain percentage of the processes and the 
identity permutation to the rest. By doing so we experimen¬ 
tally found that the ensuing algorithmic parallelism grows 
after the percentage of pipelined permutations assigned to 
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Table 1. Run (N = 5) for V equal to the identity permutation. The step row represents time steps. Id’s 
identify processes, v is the utilization string. In this case fi (that is, algorithmic parallelism) is about 
2.3077 slots, and length A = 26. Note that, if the slot is a used one, then entry {i, t) = TZj of this matrix 
is action i TZ\j. 
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Table 2. Run-table for TV = 8 and the pipelined permutation. In this case algorithmic parallelism is 6 
and length is 24. Efficiency (channel exploitation) is of 6 slots out of 9, that is 66.67%. 


the processes. In what follows we shall use symbol “Hat” 
(or where possible without ambiguity, “H”) to refer to such 
percentage. Figure [3 shows values of p, (algorithmic par¬ 
allelism) for up to 50 participating processes when various 
percentages are used. 

4 Autonomic System Evolution 

The ability to tune algorithmic parallelism paves the way 
to the definition of autonomic procedures to evolve the sys¬ 
tem. In general such evolutions take the form of a so- 
called “MAPE” adaptation loop HS), where MAPE stands 
for “Monitor-Analyze-Plan-Execute”. In the case at hand, 

Monitor signifies being able to estimate M{t), 

Analyze means checking whether matches the esti¬ 
mated value of M{t), 


Plan is a “meta-algorithm” (also known as “evolution en¬ 
gine” 161 ) responsible for choosing how to evolve the 
system. 

Execute is the execution of the meta-algorithm and the cor¬ 
responding evolution of the managed system. 

In what follows we assume the availability of a monitor¬ 
ing function called sense. A reflective system such as the 
one introduced in Q or fS) could be used to provide trans¬ 
parent access to the number of currently available ports and 
lines—that is, Af{t). The “Analyze” step is merely the as¬ 
sessment of how close the current value of Rm is to the 
estimated value of Mit). The system would evolve only in 
case of two conditions; overshooting, that is a value of at 
overabundant with respect to that of M{t), and undershoot¬ 
ing —namely, a value of Hn that would translate in a sub- 
optimal exploitation of the available contextual parallelism. 
In case the system would indeed require adaptation, several 









Figure 2. Hybrid gossiping: processes operate using a permutation seiected from two or more 
ciasses. 


meta-algorithms may be selected for the Planning step de¬ 
pending on e.g. the characteristic of the mission and the 
system assumptions: in fact complex planning is likely to 
call for non-negligible amounts of system resources, which 
could interfere e.g. with real-time requirements. A pos¬ 
sible cost-effective solution for the meta-algorithm could 
then be to make use of look-up tables providing for sev¬ 
eral values of N the algorithmic parallelism corresponding 
to some sampling of T-Ln- Figure |4] shows 200 samples of 
7 ^200, which could be computed off-line and stored in one 
such look-up table. Algorithm 2 could then be used for the 
“Execute” step. Alternatively, should performance penal¬ 
ties be deemed preferable to the memory penalty to store 
the look-up table, one could compute the curve best fitting 
the sampling of "Hat at run-time. 

Another possibility is for instance the one described in 
Algorithm 3. In this case we define A4n as an associative 
map, that is, a growing set of domain-to-value associations 
that link known values of at to corresponding known val¬ 
ues of /ta/- a system such as the one described in a could 
implement such a map. The specihc difference with respect 


to a look-up table lies in the dynamic nature of A4n, as 
it is possible to add new associations to it at all times. It 
is assumed that Ad at is initialized at least with the associ¬ 
ations corresponding to the identity and the pipelined per¬ 
mutations. 

The strategy followed in this case is to return a best 
match with the entries currently available in Ad at, and to 
add new entries to rehne dichotomic ally the list. The cur¬ 
rent linear strategy may be replaced by a more efficient one 
designed after the non-linear nature of p. Figure |5] graphi¬ 
cally depicts the working of Algorithm 3 under the follow¬ 
ing conditions: N = 200; Ad 2 oo initially includes only the 
identity and pipelined permutations; JV{t) = 13 for all val¬ 
ues of t. At the eighth iteration | Ad 200 1 = 9 and the selected 
best value for p is 13.11. 

5 State of the Art 

In this section we briefly report on methods and strate¬ 
gies to tune the characteristics of a software system so as to 
achieve high performance e.g. through the expression and 
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Figure 3. Hybrid gossiping produces different amounts of aigorithmic paraiieiism depending on the 
scheduiing distribution of the pipeiined and identity permutations. 


Algorithm 2: Tune algorithmic parallelism 
after contextual parallelism 

Input: jV(t), N, look-up table AIat; Output: 

1 hegin 

/* cp holds the current contextual parallelism */ 
/* Function now returns current time */ 

2 cp ^ sense(A/’(now())) 

3 ^best ^ min{/i : MN{h) > cp} 

4 return H^est 

5 end. 


exploitation of algorithmic parallelism. 

Concerns such as this normally are not localized in a sin¬ 
gle physical module (e.g. a function) or logical component 
(e.g., a formal parameter, as it is the case for our gossip¬ 
ing algorithms); on the contrary, they often span through 
several modules and require the joint evolution of multiple 
correlated variables. In other words, the expression of par¬ 
allelism is a typical cross-cutting concern. A typical way 
to tackle such concerns is through the usage of Aspect Ori¬ 
ented Programming (AOP) ifT^ . In AOP the source core 
consists of two separate “blocks”; the functional code deal¬ 
ing with the business logic and the aspect code to express 
one or more cross-cutting concerns. The actual source code 
is the result of a merging process called weaving where the 
functional code is rearranged, instrumented, and patched. 


according to what prescribed in the aspect code. This al¬ 
lows software systems to be effectively evolved so as to 
maximize one or more target concerns—including e.g. the 
expression of algorithmic parallelism ifTSll . Widely used in 
industry and academia, AOP proved in many cases to be 
able to manage effectively the complexity of software evo¬ 
lution. 

The general problem of guaranteeing the emergence 
of certain expected features or behaviors in a system— 
software or otherwise—was termed by Jen lfT4ll as “robust 
evolvability”. In the cited paper the author defines evolv- 
ability as an entity’s ability to “alter their structure or func¬ 
tion so as to adapt to changing circumstances” and discusses 
a system’s capability to retain certain characteristics of in¬ 
terests (e.g. maximizing algorithmic parallelism) despite 
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Figure 4. Algorithmic parallelism when 7^200 varies from 0.5% to 100% with step 0.5%. 


changes in its composition and deployment environment. 
Such capability is called by Jen as robustness: feature per¬ 
sistence under specified and unforeseen perturbations, ob¬ 
tained by switching among multiple strategic options such 
that those changes are dynamically tolerated or even ex¬ 
ploited. Interestingly enough, Jen distinguishes two classes 
of evolvable systems. 

• Phenotypically plastic systems. Such systems retain 
their structure and organization throughout adaptations 
and only achieve evolution by switching among a few, 
preordained, structurally equivalent configurations that 
depend on some internal parameter. Obviously this is 
the case for our gossiping algorithms. 

• Phenotypically dynamic systems programming sys¬ 
tem, which are able to assume different structures and 
organizations by mutating the topology, the role, and 
the number of their components. An example of this is 
given by AOP systems. 

A deeper discussion and some examples of systems match¬ 
ing the above classes may be found e.g. in || 6 |. 

We believe as worth mentioning here the case of FFTW, 
an evolvable software system that tunes its logics so as to 
maximize performance on a given target platform. FFTW 
(whose name stands for “Fastest Fourier Transform in the 
West”) is a code generator for Fast Fourier Transforms that 
defines and assembles blocks of C code that optimally solve 
FFT sub-problems on a given machine ifTOll . 

Finally we observe how nowadays it is common practice 
designing software for families of target platforms, which 
selectively enable or disable target-specific optimizations. 


One such software is the mplayer video player ID, which 
explicitly states such optimizations with messages such as 
“Using optimized IMDCT transform” or “Using MMX op¬ 
timized resampler.” The same software also permits to in¬ 
struct the use of a number of threads that matches optimally 
the amount of parallelism available on a multi-core target 
machine. 

6 Conclusions 

We presented the properties of a family of algorithms 
which retain their structural characteristics though vary 
their operation depending on a combinatorial parameter. We 
showed how such “phenotypically plastic” system dlSl 
may be used to meet various changing requirements by 
tuning dynamically the amount of algorithmic parallelism 
manifested by the software. This paves the way to en¬ 
abling robust control on the emergence of several proper¬ 
ties and behaviors, e.g. collision avoidance, deterministic 
upper bounds on energy consumption, and optimal use of 
the available communication resources. Finally we showed 
how several evolution engines may be adopted to achieve 
autonomic context-aware selection of members that best 
match the current contextual conditions. 
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